A Probabilistic Model To Support Web Crawling For Social Media Based Sites
نویسنده
چکیده
Web crawling is a popular technique for data collection among researchers, practitioners and scholars. One of the major challenges web crawling techniques often face is the inefficient utilization of bandwidth, which results in unbalanced crawl between different sites. In this paper, we described a probabilistic model to improve the crawling distribution of the overall sample. The probabilistic model is designed considering both small and large social media based sites. The proposed model is designed to achieve efficient utilization of bandwidth and greater coverage of crawling samples. The initial pilot study shows promising results.
منابع مشابه
On Social Network Web Sites: Definition, Features, Architectures and Analysis Tools
Development and usage of online social networking web sites are growing rapidly. Millions members of these web sites publicly articulate mutual "friendship" relations and share user-created contents, such as photos, videos, files, and blogs. The advances in web designing technology and fast growing usage of online resources prompted web designers to improve features and architectures of social ...
متن کاملA density based clustering approach to distinguish between web robot and human requests to a web server
Today world's dependence on the Internet and the emerging of Web 2.0 applications is significantly increasing the requirement of web robots crawling the sites to support services and technologies. Regardless of the advantages of robots, they may occupy the bandwidth and reduce the performance of web servers. Despite a variety of researches, there is no accurate method for classifying huge data ...
متن کاملOn Social Network Web Sites: Definition, Features, Architectures and Analysis Tools
Development and usage of online social networking web sites are growing rapidly. Millions members of these web sites publicly articulate mutual "friendship" relations and share user-created contents, such as photos, videos, files, and blogs. The advances in web designing technology and fast growing usage of online resources prompted web designers to improve features and architectures of social ...
متن کاملThe iCrawl System for Focused and Integrated Web Archive Crawling
The large size of the Web makes it infeasible for many institutions to collect, store and process archives of the entire Web. Instead, many institutions focus on creating archives of specific subsets of the Web. These subsets may be based around specific topics or events. Our iCrawl system provides a focused crawler that is able to automatically collect Web pages relevant to a topic based on co...
متن کاملARCOMEM Crawling Architecture
The World Wide Web is the largest information repository available today. However, this information is very volatile and Web archiving is essential to preserve it for the future. Existing approaches to Web archiving are based on simple definitions of the scope of Web pages to crawl and are limited to basic interactions with Web servers. The aim of the ARCOMEM project is to overcome these limita...
متن کامل